RStudio has many data sets already loaded in. The example below uses preloaded data direct from RStudio example dataset: mtcars.
Read about the mtcars data set.
In the rmd file, you will see how you can load your own dataset from either 1) an online source using a URL or 2) a local file on your own computer.
# LOAD DATA
library(ggplot2)
library(tidyr)
data = read.csv("data/ProcessedData.csv")
# Quick look at top 5 rows of data
head(data)
## X Date Receipt.Number Quantity Subtotal Sales.Tax Total Paid
## 1 0 2018-08-06 18179 1 24.55 2.45 27.00 NA
## 2 4 2018-08-06 18178 2 21.82 2.18 24.00 NA
## 3 7 2018-08-06 18177 3 42.73 4.27 47.00 NA
## 4 11 2018-08-06 18176 1 24.55 2.45 27.00 NA
## 5 16 2018-08-06 18175 5 28.23 2.82 31.05 NA
## 6 21 2018-08-06 18174 1 30.90 3.09 33.99 NA
## Details
## 1 1 X Cirillo Rose
## 2 2 X Fever Tree Elderflower Tonic 4pk
## 3 2 X Ps40 Smoked Lemonade + 1 X Athletes of Wine Vino Athletico Macedon Pinot noir
## 4 1 X Empty Wine Bottle 750ml + 1 X Unico Zelo Harvest Sauvignon Blanc KEG + -1 X Discount
## 5 3 X Frenchies Kolsch 330ml + 3 X Frenchies Comet Pale Ale 330ml + -1 X Discount
## 6 1 X Domaine Thomson - Explorer Pinot Noir
## Time Maximum.temperature..Degree.C. Rainfall.amount..millimetres.
## 1 18:42:40 19 0
## 2 18:03:54 19 0
## 3 17:45:58 19 0
## 4 17:32:56 19 0
## 5 16:26:31 19 0
## 6 15:05:09 19 0
## Size of data
# For the mtcars dataset, there are 32 rows (the types of cars) and 11 variables (properties of the cars).
dim(data)
## [1] 11870 12
## R's classification of data
class(data)
## [1] "data.frame"
## R's classification of variables
str(data)
## 'data.frame': 11870 obs. of 12 variables:
## $ X : int 0 4 7 11 16 21 24 34 37 40 ...
## $ Date : Factor w/ 404 levels "2017-06-24","2017-06-25",..: 404 404 404 404 404 404 404 404 404 404 ...
## $ Receipt.Number : Factor w/ 11870 levels "10000","10001",..: 8108 8107 8106 8105 8104 8103 8102 8101 8100 8079 ...
## $ Quantity : num 1 2 3 1 5 1 5 1 1 3 ...
## $ Subtotal : num 24.6 21.8 42.7 24.6 28.2 ...
## $ Sales.Tax : num 2.45 2.18 4.27 2.45 2.82 3.09 2.82 3.27 1.82 2.49 ...
## $ Total : num 27 24 47 27 31.1 ...
## $ Paid : logi NA NA NA NA NA NA ...
## $ Details : Factor w/ 6558 levels "-1 X Adelaide Hills Distillery Dry Vermouth",..: 1075 5396 5594 1609 5787 1435 2251 1095 4873 3352 ...
## $ Time : Factor w/ 9974 levels "00:28:37","01:26:38",..: 8960 8071 7704 7424 6144 4551 3576 3415 3006 2451 ...
## $ Maximum.temperature..Degree.C.: num 19 19 19 19 19 19 19 19 19 19 ...
## $ Rainfall.amount..millimetres. : num 0 0 0 0 0 0 0 0 0 0 ...
#sapply(mtcars, class)
Summary:
Complexity of data: We are looking at a dataset with len(names())
How does the maximum temperature affect the consumer decision when purcahsing alcohol?
#The temperature is a quantitative variable. We start by changing it to a qualitative one using ranges that cover 5 degrees Celcius
temp = data$Maximum.temperature..Degree.C.
data$tempGroups = cut(temp, c(10,15,20,25,30,35,40,45))
#Transaction sizes for each temperature range
meanPerPerson = aggregate(data$Total ~ data$tempGroups, data, mean)
medPerPerson = aggregate(data$Total ~ data$tempGroups, data, median)
transactions = merge(x = meanPerPerson, y = medPerPerson, by='data$tempGroups')
names(transactions) = c('Temperature', 'Mean_total', 'Median_total')
print(transactions)
## Temperature Mean_total Median_total
## 1 (10,15] 43.73074 30.0
## 2 (15,20] 54.03240 34.0
## 3 (20,25] 55.60269 34.5
## 4 (25,30] 54.05537 34.5
## 5 (30,35] 50.16873 33.0
## 6 (35,40] 62.94938 38.4
## 7 (40,45] 32.85000 27.0
ggplot(transactions, aes(Temperature, Mean_total)) + geom_bar(stat="identity", position = "dodge")
ggplot(transactions, aes(Temperature, Median_total)) + geom_bar(stat="identity", position = "dodge")
#Number of transactions per temperature range
barplot(table(data$tempGroups))
#Total money spent for each temperature range
totalPerDay = aggregate(data$Total ~ data$tempGroups, data, sum)
nrOfDaysPerTemp = aggregate(data$Date ~ data$tempGroups, data, function(x) length(unique(x)))
totals = merge(x = totalPerDay, y = nrOfDaysPerTemp, by='data$tempGroups')
names(totals) = c('Temperature', 'Total', 'NrOfDays')
totals['meanPerDay'] = round(totals$Total / totals$NrOfDays, 1)
ggplot(totals, aes(Temperature, meanPerDay)) + geom_bar(stat="identity", position = "dodge")
Summary: Looking at the values of the median and mean purchase transactions we see that there is not much change in consumer behaviour over the temperature ranges 15-35 degrees. However the more extreme temperature values have more of an effect. During the coldest times (10-15 degrees) there is a definate drop in amount of money spent per purchase. During the very hottest periods 40-45 degrees there is also a massive drop in amount spent on each transaction. However it is worth noting that there were very few transactions during that time. Another very interesting spike in sales occured at the 35-40 temperature range. This could be because people drink more alcohol, however these would be regular temperatures during Christamas time when people are on vacation and drink more alcohol in general.
How does rainfall affect the consumer decision when purchasing alcohol?
<<<<<<< HEAD Insert text and analysis. ======= There are a few days for which we do not have rain data so we start by removing those rows. We then take a better look at the rain data Let’s start by taking a better look at the rainfall data
rainData = data %>% drop_na(Rainfall.amount..millimetres.)
print(paste('Number of rows with missing values for rainfall: ', nrow(data)-nrow(rainData)))
## [1] "Number of rows with missing values for rainfall: 74"
rain = rainData$Rainfall.amount..millimetres.
print(paste('Maximum rainfall', max(rain)))
## [1] "Maximum rainfall 69.4"
print(paste('Minimum rainfall', min(rain)))
## [1] "Minimum rainfall 0"
print(paste('Mean rainfall', mean(rain)))
## [1] "Mean rainfall 1.66568328246863"
print(paste('Median rainfall', median(rain)))
## [1] "Median rainfall 0"
#We start by changing the rainfall from a quantitative variable to a qualitative one
rainData$rainGroups = cut(rain, c(0,5,15,40,70))
#Transaction sizes for each temperature range
meanPerPerson = aggregate(rainData$Total ~ rainData$rainGroups, rainData, mean)
medPerPerson = aggregate(rainData$Total ~ rainData$rainGroups, data, median)
transactions = merge(x = meanPerPerson, y = medPerPerson, by='rainData$rainGroups')
names(transactions) = c('Rainfall', 'Mean_total', 'Median_total')
print(transactions)
## Rainfall Mean_total Median_total
## 1 (0,5] 55.96415 35.990
## 2 (15,40] 57.72575 35.750
## 3 (40,70] 47.45550 39.745
## 4 (5,15] 54.24200 30.000
ggplot(transactions, aes(Rainfall, Mean_total)) + geom_bar(stat="identity", position = "dodge")
ggplot(transactions, aes(Rainfall, Median_total)) + geom_bar(stat="identity", position = "dodge")
#Number of transactions for each rainfall range
barplot(table(rainData$rainGroups))
#Total money spent for each temperature range
totalPerDay = aggregate(rainData$Total ~ rainData$rainGroups, rainData, sum)
nrOfDaysPerRain = aggregate(rainData$Date ~ rainData$rainGroups, rainData, function(x) length(unique(x)))
totals = merge(x = totalPerDay, y = nrOfDaysPerRain, by='rainData$rainGroups')
names(totals) = c('Rainfall', 'Total', 'NrOfDays')
totals['meanPerDay'] = round(totals$Total / totals$NrOfDays, 1)
ggplot(totals, aes(Rainfall, meanPerDay)) + geom_bar(stat="identity", position = "dodge")
Insert text and analysis. >>>>>>> a59d3d237998648ec4d7e6d4b61a16076b4ce9e0
Summary:
How does the time of year affect the consumer decision when purchasing alcohol?
Other possible research questions: What time of day do people buy their alcohol?
Insert text and analysis.
Summary:
TODO:
Insert text.
Style: APA
This quick reference guide will cover some basic RMarkdown for use in your projects.
Here is a basic list:
To do 1
To do 2
To do 3
Here is a simple table.
| Tables | Are | Cool |
|---|---|---|
| col 3 is | right-aligned | $1600 |
| col 2 is | centered | $12 |
| zebra stripes | are neat | $1 |
Here is am image. It has not been adjusted in the rmd file, so represents the true size of the original image. This image is sourced directly from an online url.
To learn more about adding images directly from your own computer, see the comments in this rmd file.
Image source: https://petcube.com/blog/10-all-important-kitten-supplies-infographic/
Below you will find a video embedded into your RMarkdown file. Change the YouTube link in the rmd file to get a different video.
You can even use LaTeX in an RMarkdown document!
For example, how could you work out \(\sum_{i=1}^{5} x_{i}^3\)?
Here is an R code chunk:
Try the following commands in R.
1+ exp(3) + sin(0.5)
x=c(1,2,3)
x^2
sum(x)
Here is some in-line code in-line code. You can put any R code here for display, e.g. sum(x)
Check out the resources below for more information on RMarkdown.